Import the data and packages

The Data

The data contains three main columns:

  1. Title- the title of the news tweet.
  2. Text- the actual text of the news tweet.
  3. Label- the label of the news tweet either Fake or Real.

For the purposes of this analysis, I will only focus on the text and label columns

Preprocessing

Firstly, I clean the data by applying the following preprocessing function to the data. The function performs the following:

  1. First converts all the text to lower case.
  2. Removes all URL links, nonalphanumeric text, and Twitter handles present in the text.
  3. Tokenizes the text based on each individual word (or the white space between each word) and puts these in a word list.
  4. Removes all stop words from the word list. Stop words are common words that appear too frequently in the text. Examples of these include "and", "so", "the" etc. They are necessary for sentence construction (that is, they form the parts of speech of the text) but do not add any additional meaning or value. This proves quite important for the purposes of data storage, model training, and model fine-tuning.
  5. Then finally converts the "purified" word list into a data column that is then appended to the original data frame as "cleaned_text".

Exploratory Data Analysis

Data is fairely balanced so accuracy scores will be used as the main metric to evaluate performace.

Sentiment Analysis

Lets take a look at average polarity and subjectivity of each label in the data

Lets take a look at the average polarity score and subjectivity score for each data label as well as the overall ploarity and subjectivity for the entire data

The average sentiment of the data is fairly nuetral for each data label i.e the difference in the average polarities between Fake and Real news is relatively small.The average subjectivity scores are also quite close for each label with the average subjectivity score leaning toward objective news (assuming that 0.5 is the threshold divide between subjective and objective).

Plotting the distribution of Sentiment and Polarity Scores

From the above, it looks like the for the majority of the data (for both labels)the majority of tweets lie on toward the positive end of the spectrum (although still close to Nuetral).

The subjectivity score is more evenly spread around the mean ( data is slightly subjective). We can take look at the distribution by label. Most of the tweet data is opinion.

Lets take a look at the distribution by label:

Taking a look at word frequency...

Looking at top 20 words across each target label

The most frequent word accross the entire data is "said" which is more prominent across the REAL news tweets whilst the most frequent word used across FAKE news is US. The word Trump is also quite prominent for both target labels.

Data Modelling

Train Test Split

Initialize TDif Vectorizor and applying to training example

Using Different Classifiers

We first train various models or classifiers on the data to see which model trains the fastest and yields the best accuracy. The classifiers used are Passive-Aggressive, Logistic Regression, Random Forest, SVC, and Multinomial Naive Bayes.

Model Evaluation

From the above SVC has the best accuracy score followed by Passive Aggressive Model however SVC take the most time to fit. Despite the accuracy the SVC may be difficult to scale for larger data sets. Let's also take a look at the F1 scores.

Based on the classification report above, SVC and PA classifiers have the best F1-scores although since SVC takes longer to train, PA may be the best predictor to use on newer larger datasets. Also PA a lot more suited for bigger, more transitory data (such as online data from social media) where there is a constant stream of data.

Model Interpretation

Lets evaluate why the Passive Aggressive model predicts the way it does using LIME. LIME is an interpretability surrogate model which can be used on any black-box model (model-agnostic) and provides interpretability for a single observation prediction (local). For more info on LIME refer to this medium article https://medium.com/@kalia_65609/interpreting-an-nlp-model-with-lime-and-shap-834ccfa124e4. The code below was adapted from the article. We continue with the Passive Aggresive model as I have added .pa_proba method (for converting confidence values into probabilities) so that is readily usuable in the model pipeline.

LIME takes in a pipeline as an input. We first fit the pipeline onto the training data and save this to the variable "model". Since LIME only provides local interpretability I apply LIME to a list of random indices from the x_test vector.

For the indices we have chosen, the visualisations show the following: